Grammatical Error Correction

Mina Gamil
Sherif Gabr

Problem Statement

The main problem that we will be working on this semester is to detect grammatical errors and mistakes in written English text and identify (tag) them. Then, rewrite sentence with no grammatical mistakes. The goal is to train a model that is able to detect mistakes in spelling, punctuation, and grammar.

Dataset

The datasets used are English text corpuses created synthetically or used by shared tasks like BEA19 and CoNLL2014. They could be formatted in plain English text or in M2 format standardised using the ERRANT framework (more information about M2 format and ERRANT can be found in references below). The following are the datasets used for training:

1. PIE A1 Synthetic Data

Consists of 2 files with same number of lines. One file contains the errorful English sentences, and the other contains the same sentences with no grammatical errors.
Each file contains 8,865,347 lines
The size of a single file is 1.13 GB
First 5 lines of both files:

2. Lang-8 Dataset

Consists of a single file in M2 format with ERRANT
Must be converted to source file (containing the errorful sentences) and target file (containing the corrected sentences)
The file contains 4,015,882 lines with 1,037,561 sentences
The size of the file is around 145 MB
A sample of the data:

3. NUCLE Dataset Release 3.3

Consists of a single file in M2 format standardised using the ERRANT framework
Must be converted to source file (containing the errorful sentences) and target file (containing the corrected sentences)
The file contains 158,784 lines with 57,151 sentences
The size of the file is around 9 MB
A sample of the data:

4. FCE Dataset

Consists of a single train file in M2 format standardised using the ERRANT framework
Must be converted to source file (containing the errorful sentences) and target file (containing the corrected sentences)
The file contains 111,387 lines with 28,350 sentences
The size of the file is around 5 MB
A sample of the data:

5. W&I+Locness Dataset

Consists of many training files in M2 format standardised using the ERRANT framework, with a single file being the concatenation of all.
Must be converted to source file (containing the errorful sentences) and target file (containing the corrected sentences)
The file contains 143,562 lines with 34,308 sentences
The size of the file is around 7 MB
A sample of the data:

Input/Output Examples

The input to the model is an Enlgish sentence that can contain grammatical errors. The output of the model is the English sentence rewritten to omit any grammatical errors it has.

State of the Art Model

Currently, the state-of-the-art model for grammatical error correction is the T5 (Text-to-Text-Transfer-Transformer) model by Google. This model produces state-of-the-art results for all languages except English. It is a transformer encoder-decoder model, and it comes in different sizes with the smallest base model of 600 million parameters and the largest with 13 billion parameters. Their base model was inferior to the current SOTA model, but the large model with 11 billion parameters achieved SOTA results in Czech, German, and Russian. Later with a new T5 XXL model with 11B parameters, they were able to achieve SOTA results on all languages that the model was trained on.

Orignial Model from Literature

We adopted the GECToR model for the GEC problem. The main approach is simplifying the task from sequence generation to sequence tagging. GECToR sequence tagging model architecture is an encoder made up of a pre-trained BERT-like transformer stacked with two linear layers with softmax layers on the top. The two linear layers are responsible for mistake detection and error tagging, respectively.

GECToR develops custom token-level transformations to recover the target text by applying them to the source tokens. The transformations increase the coverage of grammatical errors. The edit space consists of mainly basic transformations and some g-transformations.

The basic transformations perform the most common token-level edit operations, such as keeping the current token unchanged, deleting the current token, appending a new token, or replacing the current token with another token.
The g-transformations perform task-specific operations, such as changing the case of the current token, merging two consecutive tokens, splitting a token into two, changing the singular nouns to plural form or vice versa, and changing regular verbs to irregular form or vice versa.

Training occurs in 3 stages:

Pre-trained on synthetic errorful data (PIE synthetic data). Pretrained transformers used are BERT, RoBERTa, GPT-2, XLNet, and ALBERT.
Fine-tuned on errorful-only corpora (NUCLE, Lang-8, FCE, W&I+Locness datasets)
Fine-tuned on the combination of errorful and error-free parallel corpora (NUCLE, Lang-8, FCE, W&I+Locness datasets)

The two fine-tuning stages are important for model performance. The model is trained by Adam optimizer with default hyperparameters. The stopping criteria was 3 epochs of 10k updates each without improvement. The batch size is set to 256 for the pre-training stage and 128 for fine-tuning stages.
The model was further improved by introducing two inference hyperparameters. The first hyperparameter is a positive confidence bias to the probability of $KEEP tag. The second hyperparameter is a sentence-level minimum error probability threshold for the output of the error detection layer. The results of each stage are shown below.

Proposed Updates

We proposed some updates to the original model using new approaches, which can help some of its shortcomings and produce better result.

Update #1: Added more data

More data was shown to be beneficial for GEC problems, so we augmented the development data with the gold annotations of multiple shared tasks into the fine-tuning stage.

Update #2: Applied spell-checking in preprocessing

We applied heuristic dictionary levenshtein distance spell-checking techniques on data before training model. Our thought process is to decrease edit space, which could help model detect more grammatical errors. However, after testing, the scores were close with the spell-checked data having slightly better accuracy, but it is due to more imbalance in the data.

So, we decided to continue using the normal dataset (without pre-spellchecking)

Update #3: Tried different encoder transformers

With the development of more recent transformers than the ones mentioned in the paper, we wanted to try and test the model on different types of transformers. The transformers tested are the ELECTRA-generator model and the ELECTRA-discriminator model.

The intuition of the ELECTRA discriminator model is to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. The following are the scores for training both models in stage 2:

Based on the generator and discriminator intuition, it was expected for the ELECTRA-discriminator model to perform better and that was the case. So, we decided to train the ELECTRA-discriminator model on all 3 stages.

Update #4: Tried different encoder-decoder transformers

We wondered why not use an encoder-decoder transformer but removing the decoder. The paper hypothesizes that encoders from encoder-decoder transformers (NLG) are less useful for GEC models than transformer-encoder models (NLU), but no experiments were done. Therefore, we decided to test the hypothesis using the encoder from the T5 transformer model, since the T5 produces SOTA results in most languages. The T5 architecture:

The T5 transformer model is a Text-to-Text Transfer Transformer. One major advantage of the T5 transformer is its versatility. It can be used translation, question answering, and classification. For example, when used for question answering, it learns in the following way:

With that said, we trained the T5 encoder-only model on the second stage, and the initial results were promising.

So, we decided to further train the T5 econder-only model on all 3 stages.

Update #5: Tried different optimizers

We decided to try different optimizers, especially with the T5 model where it is recommended to use the Adagrad optimizer. We tested the T5 model and the ELECTRA-discriminator model using the default Adam optimizer, the Adafactor optimizer, and SGD optimizer. We felt that those optimizers are the most probable to produce the best results, and due to the large training times, we decided to stick with these 3 optimizers.

The Adam optimizer was used in the GECToR paper, which has been shown produced good results.
The Adafactor optimizer produced slightly better results than the Adam optimizer on the T5 model, but much worse results on the ELECTRA-discriminator model. Also, we observed that the training took less time on both models.

Eventhough the Adafactor produced better results on the T5 model, we decided to stick with the Adam optimizer for rest of tests to be consistent between models and with the paper's scores.

Update #6: Inference tweaking and tuning

The paper mentions 2 inference tweaks the greatly inlfuence performance:

Confidence Bias: Positive bias to the $KEEP tag
Minimum Error probability: Threshold for the output of the error detection layer

The motive now is that after training the models on the 3 stages, we need the optimal inference tweaks on the output model without overfitting to a specific dataset. We can achieve that by predicting on a shared task then graphing the metric and find the tweaks will results in the maximum score. The following are the resulted tests on the ELECTRA-discriminator modela and T5 model:

The following values are the best scores for different models:

Alongside the tweaks mentioned, the number of iterations also influences performance greatly. So we experimented how much does the number of iterations affect performance.

We can see that ELECTRA model reaches steady performance with the number of interations, so it does not greatly affect the ELECTRA model. However, in the T5, the performance changes greatly when the number of iterations change and there is no clear correlation between them. The steady performance in ELECTRA eases mass adoption and user experience, unlike T5.

Update #7: Tried an ensemble of transformers

In the paper, they achieved better results when using an ensemble of models. So we can do the same while using the ELECTRA discriminator and T5 models with the pretrained models of XLNet and RoBERTa. The main motive behind using an ensemble of models is trying to balance the precision and recall scores. We tried a number of combinations:

ELECTRA + T5
ELECTRA + XLNet
ELECTRA + RoBERTa
T5 + XLNet
T5 + RoBERTa
ELECTRA + T5 + XLNet + RoBERTa

The ensemble models produced much better results than single models.

Results

After applying the updates proposed, we evaluated the models on the CoNLL2014 shared task and the BEA19 shared task.

CoNLL2014 Shared Task:

BEA19 Shared Task:

Technical report

Some technical details related to training:

Python framework used: PyTorch v1.10
Training hardware: PC at Machine Learning Lab and Colab Pro
Training must occur on CUDA compatible devices
Training time: ~4 days for stage 1, ~1 days for stage 2, and ~6 hours for stage 3, so around 6 days training all stages
Number of epochs: 30 for stage 1, 10 for stage 2, and 5 for stage 3
Time per epoch usually around 1-3 hours depending on stage
Training data must be in specific format to train model (use preprocessing script of GECToR)
Prediction data is plain English text
Other python frameworks for running model:
- allennlp v0.8.4
- python-Levenshtein v0.12.1
- transformers v4.11.3
- scikit-learn v0.20.0
- sentencepiece v0.1.95
- overrides v4.1.2
- numpy v1.19.5

Conclusion

Now that we know that NLG discriminator networks are very well suited for this task. We want to train and finetune the largest ELECTRA pretrained discriminator model (ELECTRA-1.75M with 335 Million parameters) as it was shown on other tasks to enhance the score on average by around 5%. We want to also experiment with other discriminator networks other than ELECTRA. Moreover, as we learned from our experiments, ELECTRA is much faster to train when compared to other models. Thus, we suggest increasing the tag vocab size from 5000 to 10000 to increase the coverage of errors as explained by Grammarly.

References

All references used, which includes papers, datasets, and other sources which we took information from.